Binary Tree based Chinese Word Segmentation

نویسندگان

  • Kaixu Zhang
  • Can Wang
  • Maosong Sun
چکیده

Chinese word segmentation is a fundamental task for Chinese language processing. The granularity mismatch problem is the main cause of the errors. This paper showed that the binary tree representation can store outputs with different granularity. A binary tree based framework is also designed to overcome the granularity mismatch problem. There are two steps in this framework, namely tree building and tree pruning. The tree pruning step is specially designed to focus on the granularity problem. Previous work for Chinese word segmentation such as the sequence tagging can be easily employed in this framework. This framework can also provide quantitative error analysis methods. The experiments showed that after using a more sophisticated tree pruning function for a state-of-the-art conditional random field based baseline, the error reduction can be up to 20%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Refinement of Syntactic Categories in Chinese Word Structures

Annotated word structures are useful for various Chinese NLP tasks, such as word segmentation, POS tagging and syntactic parsing. Chinese word structures are often represented by binary trees, the nodes of which are labeled with syntactic categories, due to the syntactic nature of Chinese word formation. It is desirable to refine the annotation by labeling nodes of word structure trees with mor...

متن کامل

Character-Level Dependency Model for Joint Word Segmentation, POS Tagging, and Dependency Parsing in Chinese

Recent work on joint word segmentation, POS (Part Of Speech) tagging, and dependency parsing in Chinese has two key problems: the first is that word segmentation based on character and dependency parsing based on word were not combined well in the transition-based framework, and the second is that the joint model suffers from the insufficiency of annotated corpus. In order to resolve the first ...

متن کامل

Chinese Word Segmentation for Terrorism-Related Contents

In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and ...

متن کامل

Analyzing Chinese Synthetic Words with Tree-based Information and a Survey on Chinese Morphologically Derived Words

The lack of internal information of Chinese synthetic words has become a crucial problem for Chinese morphological analysis systems which will face various needs of segmentation standards for upper NLP applications in the future. In this paper, we first categorize Chinese synthetic words into several types according to their inside semantic and syntactic structure, and then propose a method to ...

متن کامل

A Maximum Entropy Chinese Character-Based Parser

The paper presents a maximum entropy Chinese character-based parser trained on the Chinese Treebank (“CTB” henceforth). Word-based parse trees in CTB are first converted into characterbased trees, where word-level part-ofspeech (POS) tags become constituent labels and character-level tags are derived from word-level POS tags. A maximum entropy parser is then trained on the character-based corpu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1305.3981  شماره 

صفحات  -

تاریخ انتشار 2011